This page documents the data cleaning and preparation procedure, variable selection, statistical modeling, and survival probability calibration procedures used in the research article: A Two-Stage Machine Learning Approach to Predict Heart Transplantation Survival Probabilities over Time with a Monotonic Probability Constraint. Data was provided from the UNOS registry by staff at the US, United Network for Organ Sharing.
The reader can show any code chunk by clicking on the code button. We chose to make the default for the code hidden since we: (a) wanted to improve the readability of this document; and (b) assumed that the readers will not be interested in reading every code chunk.
The snippet below documents the list of R packages and functions that were used in this research. For convenience, we used the package since it allows for installing/loading the needed packages in one step.
rm(list = ls()) # clear global environment
graphics.off() # close all graphics
library(pacman) # needs to be installed first
# p_load is equivalent to combining both install.packages() and library()
p_load(haven,dplyr,caret,foreign,glmnet,
lubridate,dataPreparation,httr, DT, stringr, AUC, snow, testit,caretEnsemble,
C50,
randomForest,
kernlab,
e1071,Boruta)
source("https://github.com/Ying-Ju/Heart_Transplant/raw/master/isotonic_paper_functions.R")
In this snippet below, we load the data file and the form that records the information regarding variables in the data. Both files are provided by UNOS but we added one column: INTERPRETATION_TYPE to the information form. It records the variable type for each variable in the data. This is a varaible we created based on the code book provided by UNOS. The information can be found here.
# load data set
heart.df <- read_sas("thoracic_data.sas7bdat")
# add ids for each row
heart.df$ID <- row.names(heart.df)
# heart.form contains the variable definitions of each variable in data
heart.form <- read.csv("var_desc.csv")
Here is the information included in the form.
## [1] "VARIABLE.NAME" "DESCRIPTION"
## [3] "SAS.ANALYSIS.FORMAT" "SAS.SYSTEM.DATA.LENGTH"
## [5] "SAS.FORMAT" "FORM"
## [7] "Pre" "VAR.START.DATE"
## [9] "VAR.END.DATE" "FORM.SECTION"
## [11] "DATA.TYPE" "INTERPRETATION_TYPE"
## [13] "Delete.When..." "nas"
## [15] "VARIABLE.LABEL.IN.SAS.DATA" "COMMENT"
Since our focus is to study the long term survival perdiction for an adult patient after a heart transplant is performed, we drop some variables and observations, based on the following criteria:
## In the following lines, we want to find variables that are not used after 2000 or they were added after 2000.
# Frist, we removes whitespace from start and end of each element in the variable VAR.END.DATE
var.end.dates <- trimws(heart.form$VAR.END.DATE) %>% str_trim()
# Second, identify the varaibles that are still used.
names.of.var.did.not.end <- heart.form[which(var.end.dates==""), 1]
# Make sure ID is included.
names.of.var.did.not.end <- c(as.character(names.of.var.did.not.end), "ID")
# Third, identify the variables that are not used now.
vars_ended <- heart.form[which(heart.form$VARIABLE.NAME %in% names(heart.df)[!(names(heart.df) %in% names.of.var.did.not.end)]),(1:2)]
# Figure out which date each variable was added
vars.added.dates <- heart.form$VAR.START.DATE %>% as.character.Date()
# Fix an error from the information, there were two dates corresponding to one variable. We chose the later date.
vars.added.dates[which(vars.added.dates=="01-Oct-87, 01-Oct-90")] <- "01-Oct-90"
# Figure out which year each variable was added and add this information to the heart form
heart.form$YR_ADDED <- sapply(vars.added.dates, function(x) str_extract_all(x,"[0-9]{1,2}")[[1]][2]) %>% as.integer()
# available variables added before 2000 and they are still used now
vars.added.before.2000 <- subset(heart.form, YR_ADDED>=87, select = c(1))
vars.added.NA <- subset(heart.form,is.na(YR_ADDED),select = c(1))
vars.added.all <- rbind(vars.added.before.2000,vars.added.NA)
vars.added.all <- vars.added.all[["VARIABLE.NAME"]] %>%
as.character()
vars.added.all <- c(as.character(vars.added.all), "ID")
# Based on the criteria we have to find a subset of data: getting rid of variables that are ended and patients who were under 18 or too light or too short or didn't have a heart transplant.
heart.df.cleaned <- subset(heart.df, WL_ORG=="HR") %>% # Heart
subset(AGE>=18) %>% # Adults only
# we excluded too light or too short people
subset(WGT_KG_DON_CALC >= quantile(WGT_KG_DON_CALC, 0.0001, na.rm = TRUE)) %>%
subset(WGT_KG_TCR >= quantile(WGT_KG_TCR, 0.0001, na.rm = TRUE)) %>%
subset(HGT_CM_DON_CALC >= quantile(HGT_CM_DON_CALC, 0.0001, na.rm = TRUE)) %>%
subset(HGT_CM_TCR >= quantile(HGT_CM_TCR, 0.0001, na.rm = TRUE)) %>%
subset(select=intersect(names.of.var.did.not.end,vars.added.all))
# Find variables that are related to dates
vars_discarded <- heart.form %>%
subset(INTERPRETATION_TYPE=="D", select=c(1,2))
heart.discard <- vars_discarded$VARIABLE.NAME %>% as.character()
# Identify the variables that are related to dates in the data and remove them
heart.discard <- intersect(heart.discard, colnames(heart.df.cleaned))
heart.df.cleaned <- select(heart.df.cleaned, -heart.discard)
In this section, we create several variables based on several references. 1. Medved, Dennis, et al. “Improving prediction of heart transplantation outcome using deep learning techniques.” Scientific reports 8.1 (2018): 3613. 2. Dag, Ali, et al. “Predicting heart transplantation outcomes through data analytics.” Decision Support Systems 94 (2017): 42-52.
# ref1: Medved, Dennis, et al. "Improving prediction of heart transplantation outcome using deep learning techniques." Scientific reports 8.1 (2018): 3613.
# refer to a tool provided in the: http://ihtsa.cs.lth.se/ , which is product of this paper:
# https://www.nature.com/articles/s41598-018-21417-7.pdf
# ref2: Dag, Ali, et al. "Predicting heart transplantation outcomes through data analytics." Decision Support Systems 94 (2017): 42-52.
# create several variables based on references and obtain the subset of the data based on the criteria
heart.df.cleaned <- subset(heart.df.cleaned) %>%
#ref1
mutate(PVR = (HEMO_PA_MN_TRR- HEMO_PCW_TRR)*79.72/HEMO_CO_TRR) %>%
#ref1
mutate(ISCHTIME = ISCHTIME*60) %>%
#ref1
mutate(ECMO = ifelse(ECMO_TCR + ECMO_TRR == 0, 0, 1)) %>%
# PVR, pulmonary vascular resistance / its calculation is based on the below mentioned links:
# https://en.wikipedia.org/wiki/Vascular_resistance
# http://www.scymed.com/en/smnxph/phkhr013.htm
# https://radiopaedia.org/articles/mean-pulmonary-arterial-pressure (calculation of Mean Pulmonary Arterial Pressure)
# PVR= (Mean Pulmonary Arterial Pressure (mmHg) - Pulmonary Capillary Wedge Pressure (mmHg)) * 79.72 / Cardiac Output (L/min)
# PVR = (HEMO_PA_MN_TRR - HEMO_PCW_TRR)* 79.72 / HEMO_CO_TRR
# ECMO / merge of (ECMO_TCR, ECMO_TRR)
# The following variables are mutated by the authors
mutate(BMI_CHNG = 100*(BMI_CALC- INIT_BMI_CALC)/INIT_BMI_CALC) %>%
# mutate(WAITING_TIME = TX_DATE - INIT_DATE) %>% #no need, because it is already in there as "DAYSWAIT_CHRON"
mutate(WGT_CHNG = 100*(WGT_KG_CALC - INIT_WGT_KG_CALC)/INIT_WGT_KG_CALC) %>%
mutate(HGT_CHNG = 100*(HGT_CM_CALC - INIT_HGT_CM_CALC)/INIT_HGT_CM_CALC) %>%
mutate(AGE_MAT = abs(AGE - AGE_DON)) %>%
mutate(BMI_MAT = abs(BMI_CALC - BMI_DON_CALC))
In this section, we re-group levels in some categroical variables and develop some categorical variables based on literature considering the pool of patients.
The function we use to re-group levels in a categorical variable is cat_changer() and it has four input values. The usage of this function can be found in isotonic_paper_functions.R. In the end of this snippet, we remove two variables (DEATH_CIRCUM_DON" and DEATH_MECH_DON) due to discrepency.
# We regroup Diagnosis variables (three variables: DIAG, TCR_DGN, THORACIC_DGN) as follows
val_old <- c(1000,1001,1002,1003,1004,1005,1006,1049,1007,1200)
val_new <- c("DILATED_MYOPATHY_IDI","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_OTH","DILATED_MYOPATHY_ISC","CORONARY")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="DIAG",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="TCR_DGN",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="THORACIC_DGN",val_old,val_new)
# The variable: COD_CAD_DON is re-grouped
val_old <- c(1,2,3,4,999,"Unknown")
val_new <- c("ANOXIA","CEREBROVASCULAR_STROKE","HEAD_TRAUMA","OTHER","OTHER",NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="COD_CAD_DON",val_old,val_new)
# The variables regarding blood types are re-grouped.
val_old <- c("A","A1","A2","B","O","AB","A1B","A2B")
val_new <- c("A","A","A","B","O","AB","AB","AB")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ABO",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ABO_DON",val_old,val_new)
# The variable: DIAB is re-grouped.
val_old <- c(1,2,3,4,5,998)
val_new<-c("no","one","two","OTHER","OTHER",NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="DIAB",val_old,val_new)
# previous cardiac surgery /merge of (PRIOR_CARD_SURG_TCR, PRIOR_CARD_SURG_TRR)
heart.df.cleaned$CARD_SURG <- NA
for(i in 1:nrow(heart.df.cleaned)){
if(!is.na(heart.df.cleaned$PRIOR_CARD_SURG_TCR[i])){
if(heart.df.cleaned$PRIOR_CARD_SURG_TCR[i]=="Y"){
heart.df.cleaned$CARD_SURG[i] <- "Y"
}
if(heart.df.cleaned$PRIOR_CARD_SURG_TCR[i]=="N"){
if(!is.na(heart.df.cleaned$PRIOR_CARD_SURG_TRR[i])){
if(heart.df.cleaned$PRIOR_CARD_SURG_TRR[i]=="N"){
heart.df.cleaned$CARD_SURG[i] <- "N"
}
}
}
}
if(!is.na(heart.df.cleaned$PRIOR_CARD_SURG_TRR[i])){
if(heart.df.cleaned$PRIOR_CARD_SURG_TRR[i]=="Y"){heart.df.cleaned$CARD_SURG[i] <- "Y"
}
}
}
# In the data set, several variables are related to different types of antigen alleles. Each antigen has two allele types, so these variables recoreded: 0: no match, 1: one matched, 2: both matched. Instead of using these variables, we use the variables that recored summaries of matches in these antigen alleles: HLAMIS, AMIS, BMIS , DRMIS and remove the following variables:
heart.df.cleaned[c("DA1","DA2","RA1","RA2","DB1","DB2","RB1","RB2","RDR1","RDR2","DDR1","DDR2")] <- NULL
# refrence for HLAMIS
# Weisdorf, Daniel, et al. "Classification of HLA-matching for retrospective analysis of unrelated donor transplantation: revised definitions # to predict survival." Biology of Blood and Marrow Transplantation 14.7 (2008): 748-758.
# refrences for AMIS, BMIS, DRMIS:
# Parham, Peter. The immune system. Garland Science, 2014.
val_old <- 0:6
val_new <- c("a","a","a","b","c","f","e")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HLAMIS",val_old,val_new)
# the following block is our variable manipulation based on the other literature as specified in Ali Dag's paper.
heart.df.cleaned$ETH_MAT <- NA
for(i in 1:nrow(heart.df.cleaned)){
if(!is.na(heart.df.cleaned$ETHCAT[i])){
if(!is.na(heart.df.cleaned$ETHCAT_DON[i])){
if(heart.df.cleaned$ETHCAT_DON[i]==heart.df.cleaned$ETHCAT[i]){
heart.df.cleaned$ETH_MAT[i] <- "Y"
}else{
heart.df.cleaned$ETH_MAT[i]<-"N"
}
}
}
}
heart.df.cleaned$GENDER_MAT <- NA
for(i in 1:nrow(heart.df.cleaned)){
if(!is.na(heart.df.cleaned$GENDER[i])){
if(!is.na(heart.df.cleaned$GENDER_DON[i])){
if(heart.df.cleaned$GENDER[i]==heart.df.cleaned$GENDER_DON[i]){
heart.df.cleaned$GENDER_MAT[i]<-"Y"
}else{
heart.df.cleaned$GENDER_MAT[i]<-"N"
}
}
}
}
# PROC_TY_HR, from literature
val_old <- c(1,2)
val_new <- c("Bicaval","Traditional")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PROC_TY_HR",val_old,val_new)
# The variable: SHARE_TY is regrouped.
# ALLOCATION TYPE-LOCAL/REGIONAL/NATIONAL - 3=LOCAL/4=REGIONAL/5=NATIONAL/6=FOREIGN
val_old <- c(3,4)
val_new <- c("LOCAL","REGIONAL")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="SHARE_TY",val_old,val_new)
# The variable: EDUCATION is regrouped.
val_old <- c(1,2,3,4,5,6,996,998)
val_new <- c("a","a","b","c","d","d",NA,NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="EDUCATION",val_old,val_new)
# The variables: ETHCAT, ethnicity of recepients are re-grouped
val_old <- c(1,2,4,5,6,7,9,998)
val_new <- c("w","b","h","o","o","o","o",NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ETHCAT",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="ETHCAT_DON",val_old,val_new)
# We dropped PRI_PAYMENT_CTRY_TRR and PRI_PAYMENT_CTRY_TRR because of too many NAs
heart.df.cleaned[c("PRI_PAYMENT_CTRY_TCR","PRI_PAYMENT_CTRY_TRR")] <- NULL
# Two variables: PRI_PAYMENT_TCR and PRI_PAYMENT_TRR are re-grouped.
val_old <- seq(1,14)
val_new<-c("pv","pbma","pbmcffs","pbmoth","pbmoth","pbmoth","pbmoth","OTHER","OTHER","OTHER","OTHER","OTHER","OTHER","OTHER")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PRI_PAYMENT_TCR",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PRI_PAYMENT_TRR",val_old,val_new)
# The variable: REGION is re-grouped.
val_old <- seq(1,11)
val_new <- c("NE","NE","SE","SE","W","W","MW","MW","NE","MW","SE")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="REGION",val_old,val_new)
# here we re-group two variables: FUNC_STAT_TRR and FUNC_STAT_TCR, based on their activity level and hospitalization status
# https://www.communitycarenc.org/media/tool-resource-files/what-does-it-take-qualify-personal-care-services-d.pdf
# http://www.npcrc.org/files/news/karnofsky_performance_scale.pdf
val_old <- c(1,2,3,996,998,2010,2020,2030,2040,2050,2060,2070,2080,2090,2100)
val_new <- c("A","B","B",NA,NA,"C","C","C","C","D","D","D","E","E","E")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="FUNC_STAT_TRR",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="FUNC_STAT_TCR",val_old,val_new)
#Due to discrepency, we drop these 2 variables especially the frequencies of natural cause for death are not consistent
heart.df.cleaned[c("DEATH_CIRCUM_DON","DEATH_MECH_DON")] <- NULL
In this section, we re-group levels in some categroical variables and develop some categorical variables based on their definitions from the code book provided by UNOS and their corresponding distributions. In addition, we drop categorical variables that have more than 90% of observations are NAs, variables that have more than 90% of observations are in the same level (category), and the numerical variables that have more than 30% of observations are NAs. Note that we don’t impute values for the numerical variables in the data.
# We drop variables that are related to identifiers or dates or not related to the goal of the study.
# We drop maligancy (MALIG_TY,MALIG_TY_TCR) variables because the levels of categories are not distinguishable well and there are too many NAs.
heart.df.cleaned[ c("WL_ID_CODE", "WL_ORG","INIT_DATE","TX_DATE","CTR_CODE","DATA_TRANSPLANT",
"DATA_WAITLIST","DISTANCE", "DON_RETYP","ECD_DONOR","END_OPO_CTR_CODE","HOME_STATE_DON",
"INIT_OPO_CTR_CODE", "INOTROP_VASO_CO_TRR","INOTROP_VASO_DIA_TRR","INOTROP_VASO_MN_TRR",
"INOTROP_VASO_PCW_TRR","INOTROP_VASO_SYS_TCR","INOTROP_VASO_SYS_TRR","LISTING_CTR_CODE","LOS",
"MALIG_TY","MALIG_TY_TCR","OPO_CTR_CODE","ORGAN","OTH_LIFE_SUP_OSTXT_TCR","OTH_LIFE_SUP_OSTXT_TRR",
"PERM_STATE","PRIOR_CARD_SURG_TYPE_OSTXT_TCR","PRIOR_CARD_SURG_TYPE_OSTXT_TRR","PT_CODE",
"TRR_ID_CODE")] <- NULL
# Here the NA equivalent characters are changed to NA
{
NA_cells <- c(""," ","U")
for(i in 1:length(NA_cells)){
heart.df.cleaned[heart.df.cleaned == NA_cells[i]] <- NA
gc()}
}
# Two variables: BRONCHO_LT_DON & BRONCHO_RT_DON are re-grouped.
val_old <- c(1,2,3,4,5,6,7,998)
val_new <- c("NO","BNOR","BAPS","BAOTH","BAOTH","BAOTH","BAOTH",NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="BRONCHO_LT_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="BRONCHO_RT_DON",val_old,val_new)
# The variable: CHEST_XRAY_DON is re-grouped.
val_old <- c(0,1,2,3,4,5,998,999)
val_new <- c(NA,NA,"NOR","AB","AB","ABboth",NA,NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="CHEST_XRAY_DON",val_old,val_new)
# Nine variables: CMV_DON, EBV_SEROSTATUS, HBV_CORE, HBV_CORE_DON, HBV_SUR_ANTIGEN, HCV_SEROSTATUS, HEP_C_ANTI_DON, HTLV2_OLD_DON, HIV_SEROSTATUS are re-grouped.
val_old <- c("C","I","N","ND","P","U")
val_new <- c(NA,NA,"Neg",NA,"POS",NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="CMV_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="EBV_SEROSTATUS",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HBV_CORE",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HBV_CORE_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HBV_SUR_ANTIGEN",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HCV_SEROSTATUS",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HEP_C_ANTI_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HTLV2_OLD_DON",val_old,val_new)
# In the code book, the variable HIV_SEROSTATUS didn't specify the SAS ANALYSIS FORMAT, however, after checking it's values we believe it also uses SERSTAT.
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HIV_SEROSTATUS",val_old,val_new)
# The variable: CORONARY_ANGIO is re-grouped.
val_old <- c(1,2,3)
val_new <- c("NO", "YES","YES")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="CORONARY_ANGIO",val_old,val_new)
# Two variable: HIST_DIABETES_DON and HYPERTENS_DUR_DON are re-grouped.
val_old <- c(1,2,3,4,5,998)
val_new <- c("NO","YES","YES","YES","YES",NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HIST_DIABETES_DON",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="HYPERTENS_DUR_DON",val_old,val_new)
# Two variables: END_STAT, INIT_STAT are re-grouped.
# although 2099 has low frequeency we do not merge based on the literature: Huang, Edmund, et al. "Incidence of conversion to active waitlist status among temporarily inactive obese renal transplant candidates, Transplantation 98.2 (2014): 177-186."
val_old <- c(2010,2020,2030,2090,2999)
val_new <- c("ONE","ONE","TWO","ONE","OTHER")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="END_STAT",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="INIT_STAT",val_old,val_new)
# The variable: LAST_INACT_REASON definition is re-grouped
# We got its definition from: https://www.srtr.org/requesting-srtr-data/saf-data-dictionary/
# 1: Candidate cannot be contacted
# 2: Candidate choice
# 3: Candidate work-up incomplete
# 4: Insurance issues
# 5: Medical non-compliance
# 6: Inappropriate substance use
# 7: Temporarily too sick
# 8: Temporarily too well
# 9: Weight currently inappropriate for transplant
# 10: TX'ed - removal pending UNET data correction
# 11: Inactivation due to VAD implantation and/or VAD complication
# 12: TX Pending
# 13: Physician/Surgeon unavailable
# 14: Candidate for living donor transplant only
# We regroup this variable based on if it's inactive due to health issues
val_old <- seq(1,14)
val_new <- c("ONE", "ONE","ONE","ONE","ONE", "ONE","TWO","ONE","TWO", "ONE","ONE","ONE","ONE","ONE")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="LAST_INACT_REASON",val_old,val_new)
# The variable: NUM_PREV_TX is re-grouped.
val_old <- seq(0,10)
val_new <- c("ZERO", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE", "MORE")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="NUM_PREV_TX",val_old,val_new)
# Two variables: PRIOR_CARD_SURG_TYPE_TCR and PRIOR_CARD_SURG_TYPE_TRR are re-grouped.
#search for procurement in this form: All_Forms_eg_RH.pdf
# CABG: Coronary artery bypass graft
val_old <- seq(1,31)
val_new <- c("CABG","VALV", "CABG","OTHER", "CABG","VALV", "CABG","OTHER", "CABG","VALV",
"CABG","OTHER", "CABG","VALV", "CABG","OTHER", "CABG","VALV", "CABG","OTHER",
"CABG","VALV", "CABG","OTHER", "CABG","VALV", "CABG","OTHER", "CABG","VALV", "CABG")
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PRIOR_CARD_SURG_TYPE_TCR",val_old,val_new)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="PRIOR_CARD_SURG_TYPE_TRR",val_old,val_new)
# We changed the following medicines to "HEPARIN", "ANCEF", "DOPAMINE", "ZOSYN" since these are the most repeted medicines.
temp <- heart.df.cleaned[c("PT_OTH1_OSTXT_DON", "PT_OTH2_OSTXT_DON","PT_OTH3_OSTXT_DON", "PT_OTH4_OSTXT_DON")]
new_vars <- data.frame(matrix(NA, ncol=4, nrow=nrow(heart.df.cleaned)))
colnames(new_vars) <- c("HEPARIN","ANCEF","DOPAMINE","ZOSYN")
temp <- as.data.frame(apply(temp, 2, function(x) gsub("^$| ^", NA, x)), stringsAsFactors=FALSE)
for (i in 1:ncol(new_vars)){
new_vars[,i] <- apply(temp, 1, function(x) detect_terms(x, colnames(new_vars)[i]))
}
# we remove PT_OTH1_OSTXT_DON,PT_OTH2_OSTXT_DON,PT_OTH3_OSTXT_DON,PT_OTH4_OSTXT_DON and use the four variables: HEPARIN, ANCEF, DOPAMINE, ZOSYN we just create.
heart.df.cleaned[c("PT_OTH1_OSTXT_DON","PT_OTH2_OSTXT_DON","PT_OTH3_OSTXT_DON","PT_OTH4_OSTXT_DON")]<-NULL
heart.df.cleaned<-cbind(heart.df.cleaned,new_vars)
rm(list=c("new_vars")) #remove the unnecessary object
# The variable: STERNOTOMY_TRR is re-grouped.
val_old <- c(1,2,3,998)
val_new <- c("ONE", "MORE","MORE", NA)
heart.df.cleaned <- cat_changer(heart.df.cleaned,var="STERNOTOMY_TRR",val_old,val_new)
# We drop the variable: TX_YEAR from dataset since we want to have a more versatile model
heart.df.cleaned$TX_YEAR<-NULL
#===================================================================
# Here we drop columns that 90% of their data is NA.
NA_Col_Rate <- col_missing_function(heart.df.cleaned)
NA_Col_Rate$varname <- rownames(NA_Col_Rate)
NA_Col_Rate <- NA_Col_Rate[which(NA_Col_Rate$na_count_col>0.9),]
NA_Col_Rate <- NA_Col_Rate$varname
heart.df.cleaned[NA_Col_Rate] <- NULL
#===================================================================
# Here we drop variables that have more than 90% of the observations in one level / category.
cat_dis <- vector(mode="numeric", length=ncol(heart.df.cleaned))
for(i in 1:ncol(heart.df.cleaned)){
struct <- as.data.frame(table(heart.df.cleaned[i]))
max_cat <- max(struct$Freq)
all_freq <- sum(struct$Freq)
if((max_cat/all_freq)>0.9) cat_dis[i] <- 1
}
heart.df.cleaned[(cat_dis==1)]<-NULL
#===================================================================
# Here we define categorical and numerical variables
initial_num <- heart.form$VARIABLE.NAME[which(heart.form$INTERPRETATION_TYPE=="NUM")]
mutated_num <- c("BMI_CHNG" ,"WGT_CHNG" ,"HGT_CHNG" ,"AGE_MAT","BMI_MAT","PVR")
pool_num <- c(as.character(initial_num),mutated_num)
# the numerical variables in the cleaned dataset are saved in pool_num_clean
pool_num_clean <- pool_num[which(pool_num %in% names(heart.df.cleaned))]
initial_char <- heart.form$VARIABLE.NAME[which(heart.form$INTERPRETATION_TYPE=="CHAR")]
mutated_char <- c("GENDER_MAT","ETH_MAT" ,"CARD_SURG" ,"HEPARIN" ,"ANCEF" ,"DOPAMINE","ZOSYN")
pool_char <- c(as.character(initial_char),mutated_char)
# the categorical variables in the cleaned dataset are saved in pool_char_clean
pool_char_clean <- pool_char[which(pool_char %in% names(heart.df.cleaned))]
#===================================================================
# Here we drop the numerical variables that more than 30% of the observations are NA.
# we decided not to impute numerical values, so a more conservative approach is adopted.
NA_Col_Rate <- col_missing_function(heart.df.cleaned[pool_num_clean])
NA_Col_Rate$varname <- rownames(NA_Col_Rate)
NA_Col_Rate <- NA_Col_Rate[which(NA_Col_Rate$na_count_col>0.3),]
NA_Col_Rate <- NA_Col_Rate$varname
heart.df.cleaned[NA_Col_Rate] <- NULL
# We updated the numerical variables used in the cleaned dataset
pool_num_clean <- pool_num[which(pool_num %in% names(heart.df.cleaned))]
Here are a report about the remaining variables (discarded means irrelevant/not interesting variables)
rem_type2 <- as.data.frame(table(heart.form[which(heart.form$VARIABLE.NAME %in% names(heart.df.cleaned)), "INTERPRETATION_TYPE"]))
names(rem_type2) <- c("Variable Type","Frequency")
rem_type2$`Variable Type` <- c("Categorical","Initially Discarded","Date","Numerical")
cat("here is the brand new variables that we developed from the dataset")
## here is the brand new variables that we developed from the dataset
str(heart.df.cleaned[,c("PVR","BMI_CHNG","WGT_CHNG","HGT_CHNG","AGE_MAT","BMI_MAT","GENDER_MAT","ETH_MAT","CARD_SURG","HEPARIN", "ANCEF", "DOPAMINE", "ZOSYN")])
## 'data.frame': 48442 obs. of 13 variables:
## $ PVR : num NA NA NA NA NA ...
## $ BMI_CHNG : num 0 0 0 0 0 ...
## $ WGT_CHNG : num 0 0 0 0 0 ...
## $ HGT_CHNG : num 0 0 0 0 0 ...
## $ AGE_MAT : num 28 26 19 32 28 24 14 6 43 19 ...
## $ BMI_MAT : num 5.197 9.388 2.725 0.249 5.64 ...
## $ GENDER_MAT: chr "Y" "N" "Y" "N" ...
## $ ETH_MAT : chr "N" "Y" "N" "Y" ...
## $ CARD_SURG : chr NA NA NA NA ...
## $ HEPARIN : chr NA NA NA NA ...
## $ ANCEF : chr NA NA NA NA ...
## $ DOPAMINE : chr NA NA NA NA ...
## $ ZOSYN : chr NA NA NA NA ...
cat("Number of patients: ",nrow(heart.df.cleaned),"& Number of variables: ",ncol(heart.df.cleaned))
## Number of patients: 48442 & Number of variables: 124
In this section, we apply the one-hot encoding algorithm to create dummy variables for each categorical variable. Before the one-hot encoding algorithm is applied to the categorical variables, we impute “UNKNOWN” for NA values in the categorical variables and remove the corresponding row if NA occurs for the numerical variables.
# here we made 11 consecutive dependent variables, month1, year1, year2, year3, ..., year10
{
p_unit <- c(1/12,seq(1,10))
predict_length <- 365
heart.df.cleaned$GSTATUS <- as.character(heart.df.cleaned$GSTATUS)
heart.df.cleaned <- heart.df.cleaned[complete.cases(heart.df.cleaned$GTIME),]
Responses <- as.data.frame(matrix(c(NA), ncol=11, nrow=nrow(heart.df.cleaned)))
for (i in 1:11){
Responses[,i] <- sapply(seq(1:nrow(heart.df.cleaned)), function(x, df) class_generator_bino(df$GSTATUS[x], df$GTIME[x], p_unit[i], predict_length), df=heart.df.cleaned)
Responses[,i] <- as.factor(Responses[,i])
}
colnames(Responses) <- paste("year", seq(0,10), sep="")
heart.df.cleaned <- cbind(heart.df.cleaned, Responses)
heart.df.cleaned$GSTATUS <- NULL
heart.df.cleaned$GTIME <- NULL
}
#####################
# We updated numerical and categorical variables used here:
pool_num_clean <- pool_num[which(pool_num %in% names(heart.df.cleaned))]
pool_char_clean <- pool_char[which(pool_char %in% names(heart.df.cleaned))]
# We dropped the numerical NAs and replaced the NA (missing values) in categorical variables to "UNKNOWN"
heart.df.cleaned <- heart.df.cleaned[complete.cases(heart.df.cleaned[pool_num_clean]),]
heart.df.cleaned_char <- heart.df.cleaned[pool_char_clean]
heart.df.cleaned_char[is.na(heart.df.cleaned_char)] <- "UNKNOWN"
heart.df.cleaned_num <- heart.df.cleaned[pool_num_clean]
heart.df.cleaned <- cbind(heart.df.cleaned_num,heart.df.cleaned_char,heart.df.cleaned[c(paste("year", seq(0,10), sep=""), "ID")])
pool_char_clean <- c(pool_char_clean,paste("year", seq(0,10), sep=""))
# remove unnecessary datasets
rm(list=c("heart.df.cleaned_char","heart.df.cleaned_num"))
#Make sure the type of each variable is recorded corrctly
for(i in names(heart.df.cleaned)){
if(i %in% pool_char_clean){heart.df.cleaned[i] <- as.character(heart.df.cleaned[,i])}
if(i %in% pool_num_clean){heart.df.cleaned[i] <- as.numeric(heart.df.cleaned[,i])}
}
In this snippet, we create the training and holdout datasets at each time point of interest. In order to both validate models on the holdout datasets and use isotonic regression to calibrate survival probabilities, we make sure all patients selected in the 10th year holdout dataset were included in previous time points. This is not a necessary procedure for either prediction purpose or calibration of survival probabilities.
We use a list object keep_NA to record the following IDs: 1. IDs for the dataset at each time point of interest 2. IDs for the holdout dataset at each time point of interest 3. IDs for the training dataset at each time point of interest
# We keep IDs
keep_NA <- rep(list(NA), 33)
for (i in 1:11){
keep_NA[[i]] <- heart.df.cleaned[!is.na(heart.df.cleaned[paste0("year", (i-1))]),"ID"]
}
names(keep_NA)[1:11] <- paste0("ID",seq(0,10))
all_sizes <- unname(unlist(lapply(keep_NA, length)))
test_sizes <- round(all_sizes*0.2)
train_sizes <- all_sizes - test_sizes
# We select the holdout sets that contain the year 10 test data
# Then we exclude those IDs from each year to find IDs for other training sets
set.seed(2019)
keep_NA[[12]] <- sample(keep_NA$ID10,test_sizes[11])
names(keep_NA)[12] <- "ID_holdout10"
for (i in 2:11){
keep_NA[[(11+i)]] <- c(keep_NA$ID_holdout10, sample(setdiff(keep_NA[[(12-i)]], keep_NA$ID_holdout10), (test_sizes[(12-i)]-length(keep_NA$ID_holdout10))))
}
names(keep_NA)[13:22] <- paste0("ID_holdout", seq(9,0))
for (i in 1:11){
keep_NA[[(22+i)]] <- setdiff(keep_NA[[i]], keep_NA[[paste0("ID_holdout",(i-1))]])
}
names(keep_NA)[23:33] <- paste0("ID_train", seq(0,10))
# The one-hot encoding algorithm was applied to independent categorical variables, the function dummy_maker() is used.
exclud <- c(paste0("year",seq(0,10)), "ID")
var_ind_char <- pool_char_clean[!pool_char_clean %in% exclud]
heart.df.cleaned.dum <- dummy_maker(heart.df.cleaned,var_ind_char)
In our study, three feature selection methods are adopted: Random Forest, Fast correlation based feature selection, and LASSO. This section summarizes the results from these variable selection algorithms. In order to use the code we provide here, you will need to have some experience in parallel computation using parSapply() function in the R package: snow. The Random Forest selection algorithm takes lots of time to find important features in the data. The part is conducted on Ohio Supercomputer Center.
# Here, we use the function parSapply() in snow package to perform the parallel computation for three variable selection algorithms
# for checking the definition of the feature selection algorithms, check this paper:
# Fonti, Valeria, and Eduard Belitser. "Feature Selection using LASSO."
features <- rep(list(NA), 4)
names(features) <- c("FFS", "LASSO", "RF", "all")
#=============================================================
#==============Fast Feature selection
#=============================================================
cl <- makeCluster(4, type="SOCK")
features$FFS <- parSapply(cl, 0:10, FFS_bin, heart.df.cleaned.dum, keep_NA, exclud)
stopCluster(cl)
#=============================================================
#==============Lasso Feature selection for Binomial TARGETS
#=============================================================
cl <- makeCluster(4, type="SOCK")
features$LASSO <- parSapply(cl, 0:10, Lasso_bin, heart.df.cleaned.dum, keep_NA, exclud, folds=5,trace=F,alpha=1)
stopCluster(cl)
#=============================================
#==============Random Forest Feature Selection
#=============================================
cl <- makeCluster(11, type="SOCK")
RF_result <- parSapply(cl, 0:10, RF_bin, heart.df.cleaned.dum, keep_NA, exclud)
stopCluster(cl)
#The fourth option: We used all varaibles selected from the previous three selection methods
features$all <- sapply(1:11, function(x) all_bin(x,features))
# merge function didn't check duplicates
#year 4, 5, 6 has two COD_CAD_DON_HEAD_TRAUMA
Here is the variables selected using the Fast Feature Selection algorithm.
4 variables are selected.
3 variables are selected.
6 variables are selected.
2 variables are selected.
3 variables are selected.
6 variables are selected.
6 variables are selected.
6 variables are selected.
5 variables are selected.
5 variables are selected.
7 variables are selected.
The variables selected using the LASSO Feature Selection algorithm are reported in the following.
33 variables are selected.
27 variables are selected.
24 variables are selected.
32 variables are selected.
28 variables are selected.
44 variables are selected.
36 variables are selected.
35 variables are selected.
27 variables are selected.
32 variables are selected.
45 variables are selected.
The variables selected in this subsection are from the Random Forest algorithm. The Boruta() function in the R package Boruta is used.
100 variables are selected.
110 variables are selected.
115 variables are selected.
109 variables are selected.
99 variables are selected.
97 variables are selected.
90 variables are selected.
90 variables are selected.
84 variables are selected.
83 variables are selected.
70 variables are selected.
We combine all variables selected from the above three selection algorithms.
118 variables are selected.
120 variables are selected.
121 variables are selected.
117 variables are selected.
108 variables are selected.
113 variables are selected.
103 variables are selected.
101 variables are selected.
90 variables are selected.
97 variables are selected.
92 variables are selected.
In this section, we use the following machine learning algorithms: Logistic Regression, Random Forest, Bagging, Boosting to model the training data and use the holdout dataset corresponding to each year to validate the performance of each model. Here is the outline for the whole procedure.
We repeat the above steps (1-3) 500 times to obtain 500 bootstrap samples and the model with the medain AUC value for these samples is the final model used to validate the perfomance on the holdout dataset. The variable M_AUC_index records where the median model occurs in each year and allows us to extract the model information without saving 500 models.
The study in this section was conducted in the Ohio Supercomputer Center using the Batch system: Owens with multi-cores.
# First, we make sure the type of each variable is correct
feature_names <- colnames(heart.df.cleaned.dum)
for (i in 1:length(feature_names)){
if (feature_names[i]%in%pool_num_clean){
heart.df.cleaned.dum[,feature_names[i]] <- as.numeric(heart.df.cleaned.dum[,feature_names[i]])
}else if(feature_names[i]!="ID"){
heart.df.cleaned.dum[,feature_names[i]] <- factor(heart.df.cleaned.dum[,feature_names[i]])
}
}
# Create a matrix to save the performance measures (AUC, Sensitivity, Specificity, Accuracy) for the holdout sets
Performance <- matrix(NA, ncol=11, nrow=4)
rownames(Performance) <- c("auc","sen","spec","accu")
colnames(Performance) <- paste("Year", 0:10, sep="")
# Create a matrix to save the survival probabilities for the holdout data
# Since our goal is calibrating survival probabilities using isotonic regression, we only save the survival probabilites for individulas whose survival status is known for all years in the holdout sets
Prob <- matrix(NA, ncol=11, nrow=length(keep_NA$ID_holdout10))
colnames(Prob) <- paste("Year", 0:10, sep="")
# S is used to save the total number of models created for each time point
S <- rep(NA, ncol=11)
# M_AUC is used to save the median AUC value for each time point
M_AUC <- rep(NA, ncol=11)
# Save all survival probabilities for holdout sets, so we can use this object to plot ROC curves
P <- rep(list(NA), 11)
# Index is used to save where the median model occurs
Index <- rep(NA, ncol=11)
# The following packages are used for parallel computing
library(Rmpi)
library(snow)
# Set up a timer to track the time used
time.begin <- proc.time()[3]
# assign cores used in the parallel computing
workers <- as.numeric(Sys.getenv(c("PBS_NP"))) - 1
cl1 <- makeCluster(workers,"MPI")
total_samples <- 500
for (time_point in 0:10){
assigned_seed <- 2018 + 500*time_point
variables <- c(as.character(features$LASSO[[(time_point+1)]][[1]]), paste("year",time_point,sep=""))
ID_train_name <- paste("ID_train",time_point,sep="")
ID_holdout_name <- paste("ID_holdout",time_point,sep="")
traindata <- heart.df.cleaned.dum[heart.df.cleaned.dum$ID %in% keep_NA[[ID_train_name]], variables]
hold_out <- heart.df.cleaned.dum[heart.df.cleaned.dum$ID %in% keep_NA[[ID_holdout_name]], variables]
TARGET <- variables[length(variables)]
formul <- as.formula(paste(TARGET,"~.",sep=""))
# we perform the parallel computing here to obtain 500 bootstrap samples, corresponding models and evaluate performance of these models
# the usage of the function pred_func() can be found in the file: isotonic_paper_functions.R
Allsamples <- parSapply(cl1, 1:total_samples, pred_func, traindata, hold_out, TARGET, formul, pool_num_clean, assigned_seed, "log", 5,1)
na_index <- which(is.na(unlist(Allsamples["AUC",])))
if (length(na_index)!=0){
Allsamples <- Allsamples[,-na_index]
total_samples <- ncol(Allsamples)
}
All_AUC <- sort.int(unlist(Allsamples["AUC",]), index.return = T)
M_AUC[(time_point+1)] <- All_AUC$x[(floor((total_samples+1)/2))]
M_AUC_index <- All_AUC$ix[(floor((total_samples+1)/2))]
Performance[,(time_point+1)] <- as.matrix(unname(Allsamples[1,M_AUC_index]$`Performance`))
temp_prob <- data.frame(cbind(Allsamples[2,M_AUC_index]$`Predicted`$Probability, ID=keep_NA[[ID_holdout_name]]), stringsAsFactors=F)
P[[(time_point+1)]] <- Allsamples[2,M_AUC_index]$`Predicted`
Prob[,(time_point+1)] <- as.numeric(temp_prob$V1[temp_prob$ID %in% keep_NA$ID_holdout10])
S[(time_point+1)] <- total_samples
Index[(time_point+1)] <- M_AUC_index
rm(Allsamples)
}
stopCluster(cl1)
time.end <- proc.time()[3] - time.begin
paste("It took", time.end, "seconds to run the program.")
Performance <- rbind(Performance, S)
# we export the result
saveRDS(P, file="ROC_log.rds")
write.csv(Performance, "Performance_LASSO_log.csv")
write.csv(Prob, "Probability_LASSO_log.csv", row.names = F)
mpi.quit()
In this section, Isotonic regression is used to calibrate survival probabilities for each patient whose survival status is available in the 10th year after the transplant in the test data set.
Prob <- read.csv("Probability_LASSO_log.csv")
survivals <- matrix(NA, ncol=11, nrow=nrow(Prob))
survival_Probability <- matrix(NA, ncol=11, nrow=nrow(Prob))
for (i in 1:11){
survivals[,i] <- Prob[,(12-i)]
}
# apply isotonic regression to the probability matrix
survivals_isotonic <- t(apply(survivals, 1, function(x) isoreg(x)$yf))
for (i in 1:11){
survival_Probability[,i] <- survivals_isotonic[,(12-i)]
}
colnames(survival_Probability) <- colnames(Prob)
# the survival probability matrix after isotonic regression is applied
survival_Probability <- as.data.frame(survival_Probability)
In this section, we report the results obtained. (Will update later.)
In this section, we exam how good our models are by checking the survival probability before and after applying Isotonic Regression.
We report the performance measures using the median model for the holdout datasets.
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.9021312 | 0.9010579 | 0.8938306 | 0.8957242 |
| Sensitivity | 0.9673660 | 0.9184149 | 0.8344988 | 0.8671329 |
| Specificity | 0.7452086 | 0.7767756 | 0.8021421 | 0.8010147 |
| Accurcy | 0.7617806 | 0.7873413 | 0.8045557 | 0.8059468 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.9209175 | 0.9225588 | 0.9136045 | 0.9103617 |
| Sensitivity | 0.9090909 | 0.8813747 | 0.8758315 | 0.8680710 |
| Specificity | 0.8125000 | 0.8409498 | 0.8434140 | 0.8490143 |
| Accurcy | 0.8287365 | 0.8477451 | 0.8488632 | 0.8522177 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.8464109 | 0.8352355 | 0.8208192 | 0.8316776 |
| Sensitivity | 0.7841012 | 0.7163505 | 0.7145438 | 0.7362240 |
| Specificity | 0.7697639 | 0.8375257 | 0.8074949 | 0.8121150 |
| Accurcy | 0.7729362 | 0.8107136 | 0.7869278 | 0.7953228 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.8090103 | 0.8017194 | 0.7789347 | 0.7890432 |
| Sensitivity | 0.7271306 | 0.6333073 | 0.6387803 | 0.6684910 |
| Specificity | 0.7401296 | 0.8426635 | 0.7919859 | 0.7949322 |
| Accurcy | 0.7365718 | 0.7853627 | 0.7500535 | 0.7603253 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.7763218 | 0.7635770 | 0.7399907 | 0.7547391 |
| Sensitivity | 0.6776224 | 0.5804196 | 0.5846154 | 0.6384615 |
| Specificity | 0.7206838 | 0.8259829 | 0.7911111 | 0.7569231 |
| Accurcy | 0.7065442 | 0.7453502 | 0.7233065 | 0.7180253 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.7593099 | 0.7509916 | 0.7196709 | 0.7318693 |
| Sensitivity | 0.6522310 | 0.5400262 | 0.5314961 | 0.5892388 |
| Specificity | 0.7187500 | 0.8347656 | 0.7945312 | 0.7488281 |
| Accurcy | 0.6939275 | 0.7247796 | 0.6963761 | 0.6892752 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.7514782 | 0.7340287 | 0.7108762 | 0.7296377 |
| Sensitivity | 0.6433437 | 0.5331269 | 0.5467492 | 0.5832817 |
| Specificity | 0.7221970 | 0.8170676 | 0.7639582 | 0.7544258 |
| Accurcy | 0.6888423 | 0.6969618 | 0.6720796 | 0.6820325 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.7618822 | 0.7406594 | 0.7127086 | 0.7299925 |
| Sensitivity | 0.6352801 | 0.5601907 | 0.5655542 | 0.5989273 |
| Specificity | 0.7471264 | 0.7993730 | 0.7544410 | 0.7293626 |
| Accurcy | 0.6948775 | 0.6876392 | 0.6662027 | 0.6684298 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.7678170 | 0.7547415 | 0.7344363 | 0.7498412 |
| Sensitivity | 0.6462156 | 0.5688073 | 0.5900229 | 0.6060780 |
| Specificity | 0.7490887 | 0.8280680 | 0.7606318 | 0.7752126 |
| Accurcy | 0.6961652 | 0.6946903 | 0.6728614 | 0.6882006 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.7770059 | 0.7603541 | 0.7345934 | 0.7439735 |
| Sensitivity | 0.6309255 | 0.5914221 | 0.6156885 | 0.6252822 |
| Specificity | 0.7813370 | 0.8008357 | 0.7277159 | 0.7437326 |
| Accurcy | 0.6982544 | 0.6851621 | 0.6658354 | 0.6783042 |
| Logistic Regression | Random Forest | Bagging | Boosting | |
|---|---|---|---|---|
| AUC | 0.7667817 | 0.7534881 | 0.7449029 | 0.7365267 |
| Sensitivity | 0.5935557 | 0.5754664 | 0.6178632 | 0.6048615 |
| Specificity | 0.8088924 | 0.8049922 | 0.7433697 | 0.7246490 |
| Accurcy | 0.6840380 | 0.6719108 | 0.6705998 | 0.6551950 |